different programming language
Code Vulnerability Detection Across Different Programming Languages with AI Models
Humran, Hael Abdulhakim Ali, Sonmez, Ferdi
Security vulnerabilities present in a code that has been written in diverse programming languages are among the most critical yet complicated aspects of source code to detect. Static analysis tools based on rule-based patterns usually do not work well at detecting the context-dependent bugs and lead to high false positive rates. Recent developments in artificial intelligence, specifically the use of transformer-based models like CodeBERT and CodeLlama, provide light to this problem, as they show potential in finding such flaws better. This paper presents the implementations of these models on various datasets of code vulnerability, showing how off-the-shelf models can successfully produce predictive capacity in models through dynamic fine-tuning of the models on vulnerable and safe code fragments. The methodology comprises the gathering of the dataset, normalization of the language, fine-tuning of the model, and incorporation of ensemble learning and explainable AI. Experiments show that a well-trained CodeBERT can be as good as or even better than some existing static analyzers in terms of accuracy greater than 97%. Further study has indicated that although language models can achieve close-to-perfect recall, the precision can decrease. A solution to this is given by hybrid models and validation procedures, which will reduce false positives. According to the results, the AI-based solutions generalize to different programming languages and classes of vulnerability. Nevertheless, robustness, interpretability, and deployment readiness are still being developed. The results illustrate the probabilities that AI will enhance the trustworthiness in the usability and scalability of machine-learning-based detectors of vulnerabilities.
CrossPL: Evaluating Large Language Models on Cross Programming Language Code Generation
Xiong, Zhanhang, Wang, Dongxia, Li, Yuekang, An, Xinyuan, Wang, Wenhai
--As large language models (LLMs) become increasingly embedded in software engineering workflows, a critical capability remains underexplored: generating correct code that enables cross-programming-language (CPL) interoperability. This skill is essential for building complex systems that integrate components written in multiple languages via mechanisms like inter-process communication (IPC). T o bridge this gap, we present CrossPL, the first benchmark designed to systematically evaluate LLMs' ability to generate CPL-interoperating code. CrossPL comprises 1,982 tasks centered around IPC, covering six widely-used programming languages and seven representative CPL techniques. We construct this benchmark by (i) analyzing 19,169 multi-language GitHub repositories using 156 hand-crafted finite state machines (FSMs), and (ii) developing an LLM-based pipeline that automatically extracts CPL code snippets, generates task instructions, and validates functional correctness. We evaluate 14 state-of-the-art general-purpose LLMs and 6 code-oriented LLMs released in the past three years on CrossPL via FSM-based validation. Results reveal that even the best-performing models struggle with CPL scenarios, underscoring the need for more targeted research in this space. Our benchmark and code are available at: https://anonymous.4open.science/r/crosspl-2814/. In software system development, the use of multiple programming languages (MPL) has become increasingly common, as it allows developers to exploit the unique strengths of different languages to improve performance, modularity, and scalability [1]-[5].
Multi-Agent Collaboration for Multilingual Code Instruction Tuning
Yang, Jian, Zhang, Wei, Yang, Jiaxi, Miao, Yibo, Quan, Shanghaoran, Wu, Zhenhe, Peng, Qiyao, Yang, Liqun, Liu, Tianyu, Cui, Zeyu, Hui, Binyuan, Lin, Junyang
Recent advancement in code understanding and generation demonstrates that code LLMs fine-tuned on a high-quality instruction dataset can gain powerful capabilities to address wide-ranging code-related tasks. However, most previous existing methods mainly view each programming language in isolation and ignore the knowledge transfer among different programming languages. To bridge the gap among different programming languages, we introduce a novel multi-agent collaboration framework to enhance multilingual instruction tuning for code LLMs, where multiple language-specific intelligent agent components with generation memory work together to transfer knowledge from one language to another efficiently and effectively. Specifically, we first generate the language-specific instruction data from the code snippets and then provide the generated data as the seed data for language-specific agents. Multiple language-specific agents discuss and collaborate to formulate a new instruction and its corresponding solution (A new programming language or existing programming language), To further encourage the cross-lingual transfer, each agent stores its generation history as memory and then summarizes its merits and faults. Finally, the high-quality multilingual instruction data is used to encourage knowledge transfer among different programming languages to train Qwen2.5-xCoder. Experimental results on multilingual programming benchmarks demonstrate the superior performance of Qwen2.5-xCoder in sharing common knowledge, highlighting its potential to reduce the cross-lingual gap.
Green AI: Which Programming Language Consumes the Most?
Marini, Niccolรฒ, Pampaloni, Leonardo, Di Martino, Filippo, Verdecchia, Roberto, Vicario, Enrico
AI is demanding an evergrowing portion of environmental resources. Despite their potential impact on AI environmental sustainability, the role that programming languages play in AI (in)efficiency is to date still unknown. With this study, we aim to understand the impact that programming languages can have on AI environmental sustainability. To achieve our goal, we conduct a controlled empirical experiment by considering five programming languages (C++, Java, Python, MATLAB, and R), seven AI algorithms (KNN, SVC, AdaBoost, decision tree, logistic regression, naive bayses, and random forest), three popular datasets, and the training and inference phases. The collected results show that programming languages have a considerable impact on AI environmental sustainability. Compiled and semi-compiled languages (C++, Java) consistently consume less than interpreted languages (Python, MATLAB, R), which require up to 54x more energy. Some languages are cumulatively more efficient in training, while others in inference. Which programming language consumes the most highly depends on the algorithm considered. Ultimately, algorithm implementation might be the most determining factor in Green AI, regardless of the language used. As conclusion, while making AI more environmentally sustainable is paramount, a trade-off between energy efficiency and implementation ease should always be considered. Green AI can be achieved without the need of completely disrupting the development practices and technologies currently in place.
Multi-Programming Language Ensemble for Code Generation in Large Language Model
Xue, Tengfei, Li, Xuefeng, Azim, Tahir, Smirnov, Roman, Yu, Jianhui, Sadrieh, Arash, Pahlavan, Babak
Large language models (LLMs) have significantly improved code generation, particularly in one-pass code generation. However, most existing approaches focus solely on generating code in a single programming language, overlooking the potential of leveraging the multi-language capabilities of LLMs. LLMs have varying patterns of errors across different languages, suggesting that a more robust approach could be developed by leveraging these multi-language outputs. In this study, we propose Multi-Programming Language Ensemble (MPLE), a novel ensemble-based method that utilizes code generation across multiple programming languages to enhance overall performance. By treating each language-specific code generation process as an individual "weak expert" and effectively integrating their outputs, our method mitigates language-specific errors and biases. This multi-language ensemble strategy leverages the complementary strengths of different programming languages, enabling the model to produce more accurate and robust code. Our approach can be seamlessly integrated with commonly used techniques such as the reflection algorithm and Monte Carlo tree search to improve code generation quality further. Experimental results show that our framework consistently enhances baseline performance by up to 17.92% on existing benchmarks (HumanEval and HumanEval-plus), with a standout result of 96.25% accuracy on the HumanEval benchmark, achieving new state-of-the-art results across various LLM models. The code will be released at https://github.com/NinjaTech-AI/MPLE
CodeGRAG: Extracting Composed Syntax Graphs for Retrieval Augmented Cross-Lingual Code Generation
Du, Kounianhua, Rui, Renting, Chai, Huacan, Fu, Lingyue, Xia, Wei, Wang, Yasheng, Tang, Ruiming, Yu, Yong, Zhang, Weinan
Utilizing large language models to generate codes has shown promising meaning in software development revolution. Despite the intelligence shown by the general large language models, their specificity in code generation can still be improved due to the syntactic gap and mismatched vocabulary existing among natural language and different programming languages. In addition, programming languages are inherently logical and complex, making them hard to be correctly generated. Existing methods rely on multiple prompts to the large language model to explore better solutions, which is expensive. In this paper, we propose Syntax Graph Retrieval Augmented Code Generation (CodeGRAG) to enhance the performance of LLMs in single-round code generation tasks. CodeGRAG extracts and summarizes the control flow and data flow of code blocks to fill the gap between programming languages and natural language. The extracted external structural knowledge models the inherent flows of code blocks, which can facilitate LLMs for better understanding of code syntax and serve as a bridge among different programming languages. CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation, e.g., C++ for Python.
Poisoning Programs by Un-Repairing Code: Security Concerns of AI-generated Code
AI-based code generators have gained a fundamental role in assisting developers in writing software starting from natural language (NL). However, since these large language models are trained on massive volumes of data collected from unreliable online sources (e.g., GitHub, Hugging Face), AI models become an easy target for data poisoning attacks, in which an attacker corrupts the training data by injecting a small amount of poison into it, i.e., astutely crafted malicious samples. In this position paper, we address the security of AI code generators by identifying a novel data poisoning attack that results in the generation of vulnerable code. Next, we devise an extensive evaluation of how these attacks impact state-of-the-art models for code generation. Lastly, we discuss potential solutions to overcome this threat.
Can Programming Languages Boost Each Other via Instruction Tuning?
Zan, Daoguang, Yu, Ailun, Shen, Bo, Zhang, Jiaxin, Chen, Taihong, Geng, Bing, Chen, Bei, Ji, Jichuan, Yao, Yafen, Wang, Yongji, Wang, Qianxiang
When human programmers have mastered a programming language, it would be easier when they learn a new programming language. In this report, we focus on exploring whether programming languages can boost each other during the instruction fine-tuning phase of code large language models. We conduct extensive experiments of 8 popular programming languages (Python, JavaScript, TypeScript, C, C++, Java, Go, HTML) on StarCoder. Results demonstrate that programming languages can significantly improve each other. For example, CodeM-Python 15B trained on Python is able to increase Java by an absolute 17.95% pass@1 on HumanEval-X. More surprisingly, we found that CodeM-HTML 7B trained on the HTML corpus can improve Java by an absolute 15.24% pass@1. Our training data is released at https://github.com/NL2Code/CodeM.
CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search
Sorokin, Nikita, Abulkhanov, Dmitry, Nikolenko, Sergey, Malykh, Valentin
We consider the clone detection and information retrieval problems for source code, well-known tasks important for any programming language. Although it is also an important and interesting problem to find code snippets that operate identically but are written in different programming languages, to the best of our knowledge multilingual clone detection has not been studied in literature. In this work, we formulate the multilingual clone detection problem and present XCD, a new benchmark dataset produced from the CodeForces submissions dataset. Moreover, we present a novel training procedure, called cross-consistency training (CCT), that we apply to train language models on source code in different programming languages. The resulting CCT-LM model, initialized with GraphCodeBERT and fine-tuned with CCT, achieves new state of the art, outperforming existing approaches on the POJ-104 clone detection benchmark with 95.67\% MAP and AdvTest code search benchmark with 47.18\% MRR; it also shows the best results on the newly created multilingual clone detection benchmark XCD across all programming languages.
OpenAI, GPT, ChatGPT and DALL-E Masterclass - BUKMEDIANET
Created by Raymond Davey 4.5 hours on-demand video course Ever since OpenAI arrived on the scene, access to a trained AI has become accessible to everyone. GPT allows you to ask a chatbot to complete tasks, and to answer questions. Fine-tuning allows you to change the way the AI responds. Embedding allows you to use your own knowledge base. Dall-E allows you to generate images from text.